AITopics | numa node

Collaborating Authors

numa node

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

SySCD: A System-Aware Parallel Coordinate Descent Algorithm

Nikolas Ioannou, Celestine Mendler-Dünner, Thomas Parnell

Neural Information Processing SystemsFeb-12-2026, 21:47:52 GMT

Neural Information Processing Systems http://nips.cc/

algorithm, implementation, numa node, (14 more...)

Neural Information Processing Systems

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
Asia > Middle East > Jordan (0.04)
North America > United States > California > Alameda County > Berkeley (0.04)
(2 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.69)

Add feedback

SySCD: A System-Aware Parallel Coordinate Descent Algorithm

Nikolas Ioannou, Celestine Mendler-Dünner, Thomas Parnell

Neural Information Processing SystemsOct-9-2025, 14:49:51 GMT

Today's individual machines offer dozens of cores and hundreds of gigabytes of RAM that can, if used efficiently, significantly improve the training performance of machine learning models.

artificial intelligence, implementation, machine learning, (15 more...)

Neural Information Processing Systems

Country:

Europe (0.46)
North America > United States (0.28)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.69)

Add feedback

Sandwich: Separating Prefill-Decode Compilation for Efficient CPU LLM Serving

Zhao, Juntao, Li, Jiuru, Wu, Chuan

arXiv.org Artificial IntelligenceJul-25-2025

Utilizing CPUs to serve large language models (LLMs) is a resource-friendly alternative to GPU serving. Existing CPU-based solutions ignore workload differences between the prefill and the decode phases of LLM inference, applying a static per-NUMA (Non-Uniform Memory Access) node model partition and utilizing vendor libraries for operator-level execution, which is suboptimal. We propose Sandwich, a hardware-centric CPU-based LLM serving engine that uses different execution plans for the prefill and decode phases and optimizes them separately. We evaluate Sandwich across diverse baselines and datasets on five CPU platforms, including x86 with AVX-2 and AVX-512, as well as ARM with NEON. Sandwich achieves an average 2.01x throughput improvement and 90% satisfactory time-to-first-token (TTFT) and time-per-output-token (TPOT) latencies with up to 3.40x lower requirements in single sequence serving, and significant improvement in Goodput in continuous-batching serving. The GEMM kernels generated by Sandwich outperform representative vendor kernels and other dynamic shape solutions, achieving performance comparable to static compilers with three orders of magnitude less kernel tuning costs.

large language model, machine learning, service configuration, (18 more...)

arXiv.org Artificial Intelligence

2507.18454

Country:

Europe (0.93)
North America > United States > California (0.67)

Genre:

Workflow (0.88)
Research Report > New Finding (0.46)

Industry: Telecommunications (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Symmetry-Preserving Architecture for Multi-NUMA Environments (SPANE): A Deep Reinforcement Learning Approach for Dynamic VM Scheduling

Chan, Tin Ping, Cheng, Yunlong, Zhu, Yizhan, Gao, Xiaofeng, Chen, Guihai

arXiv.org Artificial IntelligenceApr-22-2025

As cloud computing continues to evolve, the adoption of multi-NUMA (Non-Uniform Memory Access) architecture by cloud service providers has introduced new challenges in virtual machine (VM) scheduling. To address these challenges and more accurately reflect the complexities faced by modern cloud environments, we introduce the Dynamic VM Allocation problem in Multi-NUMA PM (DVAMP). We formally define both offline and online versions of DVAMP as mixed-integer linear programming problems, providing a rigorous mathematical foundation for analysis. A tight performance bound for greedy online algorithms is derived, offering insights into the worst-case optimality gap as a function of the number of physical machines and VM lifetime variability. To address the challenges posed by DVAMP, we propose SPANE (Symmetry-Preserving Architecture for Multi-NUMA Environments), a novel deep reinforcement learning approach that exploits the problem's inherent symmetries. SPANE produces invariant results under arbitrary permutations of physical machine states, enhancing learning efficiency and solution quality. Extensive experiments conducted on the Huawei-East-1 dataset demonstrate that SPANE outperforms existing baselines, reducing average VM wait time by 45%. Our work contributes to the field of cloud resource management by providing both theoretical insights and practical solutions for VM scheduling in multi-NUMA environments, addressing a critical gap in the literature and offering improved performance for real-world cloud systems.

machine learning, numa node, reinforcement learning, (16 more...)

arXiv.org Artificial Intelligence

2504.14946

Country: Asia > China (0.14)

Genre: Research Report (1.00)

Industry: Information Technology > Services (0.48)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

P-MOSS: Learned Scheduling For Indexes Over NUMA Servers Using Low-Level Hardware Statistics

Rayhan, Yeasir, Aref, Walid G.

arXiv.org Artificial IntelligenceNov-5-2024

Ever since the Dennard scaling broke down in the early 2000s and the frequency of the CPU stalled, vendors have started to increase the core count in each CPU chip at the expense of introducing heterogeneity, thus ushering the era of NUMA processors. Since then, the heterogeneity in the design space of hardware has only increased to the point that DBMS performance may vary significantly up to an order of magnitude in modern servers. An important factor that affects performance includes the location of the logical cores where the DBMS queries are scheduled, and the locations of the data that the queries access. This paper introduces P-MOSS, a learned spatial scheduling framework that schedules query execution to certain logical cores, and places data accordingly to certain integrated memory controllers (IMC), to integrate hardware consciousness into the system. In the spirit of hardware-software synergy, P-MOSS solely guides its scheduling decision based on low-level hardware statistics collected by performance monitoring counters with the aid of a Decision Transformer. Experimental evaluation is performed in the context of the B-tree and R-tree indexes. Performance results demonstrate that P-MOSS has up to 6x improvement over traditional schedules in terms of query throughput.

index slice, p-moss, scheduling policy, (14 more...)

arXiv.org Artificial Intelligence

2411.02933

Country:

North America > United States > Indiana > Tippecanoe County > West Lafayette (0.04)
North America > United States > Indiana > Tippecanoe County > Lafayette (0.04)

Genre: Research Report (0.70)

Industry: Information Technology (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.94)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (0.66)

Add feedback

Inference Acceleration for Large Language Models on CPUs

PS, Ditto, VG, Jithin, MS, Adarsh

arXiv.org Artificial IntelligenceMar-4-2024

In recent years, large language models have demonstrated remarkable performance across various natural language processing (NLP) tasks. However, deploying these models for real-world applications often requires efficient inference solutions to handle the computational demands. In this paper, we explore the utilization of CPUs for accelerating the inference of large language models. Specifically, we introduce a parallelized approach to enhance throughput by 1) Exploiting the parallel processing capabilities of modern CPU architectures, 2) Batching the inference request. Our evaluation shows the accelerated inference engine gives an 18-22x improvement in the generated token per sec. The improvement is more with longer sequence and larger models. In addition to this, we can also run multiple workers in the same machine with NUMA node isolation to further improvement in tokens/s. Table 2, we have received 4x additional improvement with 4 workers. This would also make Gen-AI based products and companies environment friendly, our estimates shows that CPU usage for Inference could reduce the power consumption of LLMs by 48.9% while providing production ready throughput and latency.

gen intel xeon scalable processor, intel xeon scalable processor, utilization, (11 more...)

arXiv.org Artificial Intelligence

2406.07553

Genre: Research Report (0.43)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.72)

Add feedback

SySCD: A System-Aware Parallel Coordinate Descent Algorithm

Ioannou, Nikolas, Mendler-Dünner, Celestine, Parnell, Thomas

arXiv.org Machine LearningNov-18-2019

In this paper we propose a novel parallel stochastic coordinate descent (SCD) algorithm with convergence guarantees that exhibits strong scalability. We start by studying a state-of-the-art parallel implementation of SCD and identify scalability as well as system-level performance bottlenecks of the respective implementation. We then take a principled approach to develop a new SCD variant which is designed to avoid the identified system bottlenecks, such as limited scaling due to coherence traffic of model sharing across threads, and inefficient CPU cache accesses. Our proposed system-aware parallel coordinate descent algorithm (SySCD) scales to many cores and across numa nodes, and offers a consistent bottom line speedup in training time of up to x12 compared to an optimized asynchronous parallel SCD algorithm and up to x42, compared to state-of-the-art GLM solvers (scikit-learn, Vowpal Wabbit, and H2O) on a range of datasets and multi-core CPU architectures.

dataset, implementation, numa node, (13 more...)

arXiv.org Machine Learning

1911.07722

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
Asia > Middle East > Jordan (0.04)
North America > United States > California > Alameda County > Berkeley (0.04)
(2 more...)

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)

Add feedback

Parallel training of linear models without compromising convergence

Ioannou, Nikolas, Dünner, Celestine, Kourtis, Kornilios, Parnell, Thomas

arXiv.org Machine LearningNov-5-2018

In this paper we analyze, evaluate, and improve the performance of training generalized linear models on modern CPUs. We start with a state-of-the-art asynchronous parallel training algorithm, identify system-level performance bottlenecks, and apply optimizations that improve data parallelism, cache line locality, and cache line prefetching of the algorithm. These modifications reduce the per-epoch run-time significantly, but take a toll on algorithm convergence in terms of the required number of epochs. To alleviate these shortcomings of our systems-optimized version, we propose a novel, dynamic data partitioning scheme across threads which allows us to approach the convergence of the sequential version. The combined set of optimizations result in a consistent bottom line speedup in convergence of up to $\times12$ compared to the initial asynchronous parallel training algorithm and up to $\times42$, compared to state of the art implementations (scikit-learn and h2o) on a range of multi-core CPU architectures.

artificial intelligence, dataset, machine learning, (18 more...)

arXiv.org Machine Learning

1811.01564

Country: Europe > Switzerland > Zürich > Zürich (0.14)

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.47)

Add feedback

Optimization tips and tricks on Azure SQL Server for Machine Learning Services

#artificialintelligenceMay-8-2017, 16:45:19 GMT

Since SQL Server 2016, a new function called R Services has been introduced. Microsoft recently announced a preview for the next version of SQL Server, which extends the advanced analytical ability to Python. This new capability of running R or Python in-database at scale enables us to keep the analytics services close to the data and eliminates the burden of data movements. To get the most out of SQL server, knowing how to fine tune the intelligence model itself is far from sufficient and sometimes still fail to meet the performance requirement. There are quite a few optimization tips and tricks that could help us boost the performance significantly.

artificial intelligence, machine learning, sql server, (14 more...)

#artificialintelligence

Industry: Education (0.77)

Technology:

Information Technology > Databases (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

DimmWitted: A Study of Main-Memory Statistical Analytics

Zhang, Ce, Ré, Christopher

arXiv.org Machine LearningJul-7-2014

We perform the first study of the tradeoff space of access methods and replication to support statistical analytics using first-order methods executed in the main memory of a Non-Uniform Memory Access (NUMA) machine. Statistical analytics systems differ from conventional SQL-analytics in the amount and types of memory incoherence they can tolerate. Our goal is to understand tradeoffs in accessing the data in row- or column-order and at what granularity one should share the model and data for a statistical task. We study this new tradeoff space, and discover there are tradeoffs between hardware and statistical efficiency. We argue that our tradeoff study may provide valuable information for designers of analytics engines: for each system we consider, our prototype engine can run at least one popular task at least 100x faster. We conduct our study across five architectures using popular models including SVMs, logistic regression, Gibbs sampling, and neural networks.

artificial intelligence, data mining, machine learning, (20 more...)

arXiv.org Machine Learning

1403.755

Country: North America > United States (1.00)

Genre:

Research Report > New Finding (0.48)
Research Report > Experimental Study (0.34)

Industry: Government > Regional Government > North America Government > United States Government (0.68)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
(2 more...)

Add feedback